A definition of conditional probability distribution 
with non-stochastic information 

Pier Giovanni Bissiri * and Stephen G. Walker * 
January 12, 2013 

Abstract 

The current definition of a conditional probability distribution en- 
ables one to update probabilities only on the basis of stochastic infor- 
mation. This paper provides a definition for conditional probability 
distributions with non-stochastic information. The definition is de- 
rived as a solution of a decision theoretic problem, where the informa- 
tion is connected to the outcome of interest via a loss function. We 
shall show that the Kullback-Leibler divergence plays a central role. 
Some illustrations are presented. 
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1 Introduction 

The theory of conditional probability distributions is a well-established math- 
ematical theory that provides a procedure to update probabilities taking 
into account new information. To motivate the new work in this paper, we 
mention that such a procedure is available only if the information which is 
used to update the probability concerns stochastic events; that is, events to 
which a probability is assigned. In other words, such information needs to 
be already included into the probability model. 
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1.1 Notation 



Before proceeding, we introduce the notation. Let Y be a random variable 
on a probability space (f2,J^",P), which will be the outcome of interest, 
and valued into a measurable space (Y, with probability distribution P. 
Hence, P represents initial belief about the outcome concerning Y. By / we 
shall denote the information obtained about Y. If / is stochastic, then we 
shall represent it by a random variable X from (fi, J^, P) into (X, SC) with 
probability distribution Q and I will be assumed to be an outcome of X. 
We will denote by Pi the updated P given information /. 

We will let D denote the Kullback-Leibler divergence (relative entropy), 

i.e. 



D(Q 1 ,Q 2 ) = J log f^j dQi 



for any couple (Q\,Q2) of probability measures such that Q\ <C Qi- More 
generally we define the ^-divergence: 



D g {Q 1 ,Q 2 ) = J g(^) dQ 2 



for any couple (Qi,Q2) of probability measures such that Q\ <C Q2, where 
g is a convex function from (0, 00) into R such that g(l) = 0. This class of 
probability d i screp ancie s has be e n intr oduced and studied independently by 
Ali fc Silvevl (|l966h and lCsiszarl (|l967l ). The Kullback-Leibler diver gence is 



a particular case, which can be obtained taking g(x) = x\og(x). 



1.2 Mathematical framework 

When the standard definition of conditional probability does not apply, for 
reasons we discuss later, we present an alternative definition based on a 
mathematical decision theoretic framework. When information received is 
non-stochastic, but relevant to an outcome of interest, we cannot use a 
probability distribution and so we need an alternative way to connect the 
information I with outcome of interest Y . We do so using loss functions. 

The purpose of this paper is to provide a definition of a conditional 
distribution of Y on the basis of /, which we shall denote by Pj. We take 
the pair (/, P) to Pj as the solution to a decision problem based on the 
minimization of a cumulative loss function. This loss function will be defined 
on the class of probability measures on W that are absolutely continuous 
with respect to P, call this 3? . Indeed, the conditional probability should 
be zero on every event whose unconditional probability is zero. Here, A € 
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will denote the action and the best choice, i.e. minimizing the loss function, 
will be defined as the conditional probability distribution for Y given /. In 
order to properly assess the loss function, it will be expressed as the sum 
(cumulative loss) of two terms, i.e. 

L(X) = H I (X)+l(X,P), (1) 

where P) is a discrepancy between the probability measure A and P and 
Hi (A) is the component of the loss that takes into account the information 
relating to /. In fact, we will show that l(X,P) should be the Kullback- 
Leibler divergence for coherence purposes. So, Pj will be defined as that A 
which minimizes L(X). 



1.3 Relation to the literature 



In the literature, definitions of conditional probability, such as the Jeffrey's 
Rule of conditioning, are given where new information is not put in terms of 
the occurrence of an event included in the model. These definitions rely on 
the assumption that the information can be given in the form of a constraint 
(or a combination of constraints) on the probability. Constraints considered 
are of the type 

r g(y)X(dy) > 0, (2) 



where g is a measurable real function on Y and the strict inequality is 
sometimes replaced by a not strict one. The idea is to minimize D(X, P) 
subject to the constraint (|2|), which represents information /. This problem 
can be solved, i.e. Pj can be obtained, by minimizing D(A,P) subject to 
the constraint using Lagrange multipliers. 

Such a procedure of condizionalization is a specific case in our approach. 
In fact, it is equivalent to minimize the loss function (pQ) taking I equal to 
the Kullback-Leibler divergence and 



fl/(A) 



i£f Y g(y)X(dy) > 0, 
+oo ]£j Y g(y)X(dy) < 0. 



For more details about c onditionalizati o n bas e d upon constra i nts on the 
condit io nal distribution, se e Van Fraassen ( 1992 ). Skvrmsl dl985h . Domotor 
(jl985h . biaconis fc Zabelll (jl982h and IShore fc Johnson! (jl98(lh . Our ap- 
proach is different as we encompass potentially arbitrary information about 
Y, so as long as it is possible to construct a loss function hj(y) for each Y 
given /. 
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1.4 Motivation 



The random variable Y represents an unknown quantity to which a proba- 
bility distribution has been assigned and needs to be updated on the basis 
of new information /. If I coincides with an outcome of another random 
variable X, then it is possible to update the unconditional distribution of 
Y to the probability distribution of Y given X. However, to do this, it is 
required to know all the possible alternatives of /, that is, all the outcomes 
of X . Moreover, it is required to assess the joint distribution of X and Y or 
the conditional distribution of X given Y. This is quite easy if, for instance, 
/ is known to be an outcome of some well-defined random experiment. In 
many situations, one has seen the outcome X and in order to establish an 
update of the distribution of Y, one needs to retrospectively ponder and 
imagine a joint probability model. 

This difficulty arises in different puzzles such as, for instance, Freund's 
puzzle of the two aces, introduced by Freund| ( 1965 ) . For ot her puzzles about 
conditional probabilities, see, for instance, Gardner ( 19591 ). 

These puzzles have been widely used t o discuss the concept of condi- 
tional probability. iHutchison (|l999l . 120081 ) emphasizes that the updating 
process needs to take into account the circumstan ces under which the truth 
of / was conveyed. Also, iBar-Hillel &: Falkl (|1982l ) claim that to know how 
the knowledge was obtained is "a crucial ingredient to select the appropriate 
model". These scholars present different views about the concept of condi- 
tionalization, but all agree on the fact that there would not be a problem 
if it was known how the information I became available, and therefore one 
could build a model including I. 

The concept of conditional probability distributions is certainly appro- 
priate as a procedure to update probabilities on the basis of any new in- 
formation that was already included in the probability model. But it can 
be difficult to construct a model that considers all possible relevant infor- 
mation that in the future could become available. Therefore, the problem 
arises when one obtains some new and possibly unexpected information and 
wants to use it to update a probability distribution. Indeed, it does not seem 
appropriate to assess the probability of something which has been already 
observed. Our basic assumption is that the information / can be connected 
to the outcome of interest via a loss function Hj defined on the set of all 
possible outcomes of Y. The conditional distribution of Y given I will be 
defined as the one that minimizes a cumulative loss in the form given by (pQ) . 
In this way, it is possible to update the distribution of Y, even if / is some 
new unexpected information, which was not included in the probabilistic 
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framework. It will be shown that if instead / is the outcome of a random 
variable X and there is a joint density / for (X,Y), then one can recover 
as particular case the conditional distribution of Y given X. To do this, I 
is taken to be the Kullback-Leibler divergence. It will be proved that in 
general it is necessary for the updating procedure to be coherent that / is 
the Kullback-Leibler divergence. 



1.5 Description of the paper 

Section [2] contains the main results. In Section some examples will be 
considered. One such is as follows: assume that Y is a scalar quantity and 
one learns that Y is close to zero. An answer will be given to this question: 
how could one update the distribution of Y after learning such information? 
Section 0] contains a discussion. 



2 Defining conditional probability distributions with 
non— stochastic information 

This section reports the current definition of conditional probability distri- 
bution and presents and motivates our definition for conditional probability 
distribution with non-stochastic information. 



2.1 The current definition 

In probability theory, a conditional distribution of Y given X is a map p 
from f xX into R such that: 

• for each x in X, p(-, x) is a probability measure on 

• for each B in & , p(B, X(lj)) is a version of the conditional probability 
P(Y € B j X(u)), i.e. for each A in X and each B in 

F{X £ A, Y <E B} = [ p(B, x) dQ(x), (3) 
J A 

where Q denotes the probability distribution of X. 

The conditional distribution is known to be essentially unique, i.e. unique 
only up to a.s . equ ality. This is a consequence of X being stochastic. In 



fact, as iFellerl ()197ll . page 160) points out, if, for instance, the distribution 



of X is concentrated on a subset Xq of X, no natural definition of p(B,x) 
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is possible for x outside Xo- Nevertheless, in individual cases, there usually 
exists a natural choice dictated by regularity requirements. 

Moreover, it is well known that conditional distributions do not always 
exist unless some conditions are satisfied by the spaces (X, 3£) and (Y, 
For more information ab out conditional p robability distributions, see, for 
instance, Feller ( 1971 ) or Billingslev ( 19951 ). 



This paper will consider the case in which there are two cr-finite measures 
/j, and v on & such that the probability distribution of (X, Y) is absolutely 
continuous with respect to (iX!/, Denote its density by /. This is a general 
framework which includes most applications and enables to find easily an 
expression for the conditional distributions. Generally, X and Y are subsets 
of for some k, and \i and v are the corresponding Lebesgue measure. 

If / is the density of the probability distribution of (X, Y) with respect 
to fj, x v, then one can take 

f B f(x,y) u(dy) 

for every B in Y and every x in X such that 

< f x (x) := / f(x,y) u(dy) < oo. (5) 
Jy 

Note that p(-,x) is absolutely continuous w.r.t. v and its density is 

fy\x(y\%) : = f(x,y)/fx{x), (6) 

for every x in X satisfying ([5]). The density ([6]), which is called the con- 
ditional density of Y given X, is what is used in most application to find 
an expression for the conditional distribution. Therefore, @ deserves to be 
considered as the "practical definition" of conditional probability distribu- 
tion. Indeed, it is the natural version of the conditional distribution of Y 
given X whenever a joint density / exists for X and Y. 



2.2 The loss function 

Given it is not always possible to relate new information I to Y through 
probability models, instead, we will rely on the use of loss functions to 
"connect" the information I to Y. We will deal with the theory first, and 
then present some examples. 

Before proceeding, let us recall that q(B, •) satisfying ([3]) can be seen as 
the solution of a minimization problem whenever Y is in L 2 (f2, & ', P), by 
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resor ting to the theory of Hilbert spaces (see, for instance. I Jacod fc Protter 



2003). Clearly, this approach relies on the joint distribution of X and Y 



and therefore is not available when X is replaced by some non-stochastic 
information I. 

So, our aim is to define a conditional probability distribution as a solution 
of a decision problem with a fully motivated loss function; connecting the 
action, i.e. the conditional distribution, with current and given pieces of 
information: namely the probability distribution P of Y and /, respectively. 

The form of the loss function we consider is ([I]). In particular, Hj(\) 
will be taken in the integral form i.e. the average or expected loss 

fl>(A) = / h T (y) A(dy), 
Jy 

where hj(-, P) is a loss function defined on Y. It is more reasonable to assess 
the loss relating to Y and therefore it is reasonable to be able to construct 
hj(y). Examples will be considered later. If A then represents beliefs about 
Y, it is appropriate to consider the expected loss here. Therefore, to define 
conditional distributions, a cumulative loss will be used of the following 
form: 

/ hj(y) A(dy) + l(X,P). (7) 
Jy 

This general cumulative loss then represents or assesses the loss to the deci- 
sion maker if they select probability measure A in the presence of information 
I and P. 



2.3 Stochastic information 

Let us see how this works when indeed / is equivalent to a random variable 
X and there is a joint density / for (X, Y). In this setting, the conditional 
distribution (JH) arises as the solution of a decision theoretic problem. To see 
this, for every x in X satisfying ([5]), define the following loss function L x : 

L x {\) := - I log(f(x,y)/f Y (y)) A(dy) + D(\,P), (8) 
Js 

where 

My) ■= / f(x,y) v{dx), 
Jx 

S is the set of all y in S such that < /y(y) < oo, P is the probability 
distribution of Y, A is a probability measure on Y absolutely continuous 
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w.r.t. P, and D The loss ([8]) is of the form ([7|) with 



D(X,P) 



and 



hi(y) = h(y,x) : 



I s (y)log(f(x,y)/f Y (y)) 



(9) 



where Is(y) is equal to 1 or depending on whether y belongs to S or not. 

For every x in X satisfying @, the conditional distribution given 
by @ minimizes the loss L x , since 



In the loss ([8]), the first addendum depends on the joint density function 
of X and Y and therefore, to be able to define such loss, X needs to be 
stochastic. In other words, a probability distribution has to be assigned to 



The loss ([9]) is known as the self-information loss function and the most 
commonly used when x has come from a specified family of densities. So, Hj 
turns out to be the the expected or average loss, using the self-information 
loss function — log fx\y{ x \y)- 

2.4 Non— stochastic information 

If the random variable X is replaced by some non-stochastic information I, 
then the self-information loss ([9]) cannot be defined, but one can still resort 
to a loss function of the form ([7]), assessing hi(y) in a different way. As usual, 
hi(y) evaluates the additional loss in outcome y due to the acquirement of 
/. Some examples for this will be considered later. 

In the loss (|8j), the Kullback-Leibler divergence from the marginal of Y 
can also be replaced by a more general discrepancy, such as the g-divergence. 
This leads us to consider a more general loss function than (|8|) as follows: 



where hi is assessed after learning /, information which does not need to be 
stochastic. As the loss ([8]), the loss (fit))) is defined on the class of probability 
measures on that are absolutely continuous with respect to P, which is 




X. 




(10) 



8 



reasonable. Assume there is a unique probability measure that minimizes 
(|10p in the class of probability measures on <3f absolutely continuous with 
respect to P. Then, it will be called the conditional distribution of Y given 
the information / (according to the discrepancy D g and the loss hi) and it 
will be denoted by Pj. 

At this stage, assume that another piece J of information is available in 
addition to I and that I and J are not overlapping pieces of information. 
This happens, for instance, in the stochastic case when / and J are outcomes 
of two independent random variables. We shall write I J (or equivalently J I) 
to denote the information obtained combining / with J. Being / and J not 
overlapping, we choose hi, hj and hu satisfying the following additivity 
property: 

hu(y) = hi(y) + hj(y). (11) 

Clearly, updating the distribution P on the basis of / and J and updating 
the conditional distribution Pi on the basis of J only, should yield the same 
probability distribution for Y. In the first case, the updated probability 
distribution is obtained by minimizing the loss: 

/ hu(y) A(dy) + D g {\,P). (12) 
Jy 

In the second one, the loss to minimize is: 

/ hj(y) A(dy) + D g (X,Pi). (13) 
Jy 



The two losses (112ft and (113ft should yield the same updated probability 
distribution for Y. 

For this coherence condition to be in force, it is necessary that the dis- 
crepancy D g is the Kullback-Leibler divergence. To be more precise, the 
following theorem can be stated: 

Theorem. Let P := Pj, and assume that (jllft holds and 

Pi J = Pj, (14) 

for every probability measure P on <3f and for every choice of the loss func- 
tions hi and hj such that Pi, Pjj and Pj are all properly defined. 
Then D g is the Kullback-Leibler divergence. 



Proof . This result is proven from a different starting point in lBissiri &; Walker 



( 20ld . Theorem 2.5). Here, a shorter proof is given by assuming the differ- 



entiability of g. 
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Assume that Y contains at least two distinct points, say yo an d y\- 
Otherwise, P is degenerate and the thesis is trivially satisfied. 

To prove this theorem, it is sufficient to consider a very specific choice 
for P, taking P = poS yo + (1 — po)8 yi , where < po < 1. Any probability 
measure A « F has to be equal to p5 yo + (1 — p)S yi , for some < p < 1. 
Therefore, in this specific situation, the loss (jlOp becomes: 

l(p,Po,hi) :=ph I (y ) + (l-p)h I (y 1 ) 

+ Pog (—) + (i -po)g 



PoJ V 1 - Po 

Denote by p\ the probability Pi({yo}), i.e. the minimum point of l(p,po, hi) 
as a function of p, and by p2 the probability P/j({yo})- By hypotheses, p2 is 
the unique minimum point of both loss functions l(p,pi, hj) and l(p,Po, hu). 
Again by hypothesis, we shall consider only those functions hj and hj such 
that each one of the functions l(p,po,hi), l(p,pi,hj), and l(p,po,hu), as a 
function of p, has a unique minimum point, which is p\ for the first one and p2 
for the second and third one. The values p\ and p2 have to be strictly bigger 



than z ero and strictly smaller than one: this was proved by lBissiri &; Walker 



(l2O10l . Lemma 2). Hence, p\ has to be a stationary point of l(p,po, hi) and 



P2 of both the functions l(p,pi,hj) and l(p,po,hu). Therefore, 

a' (f o ) ~ 9' ( ) = Mi/i) " ^/(W)), (15) 

9 ) 



1 


-Pi 


1 




1 


-P2 


1 


-Po 


1 


~P2 


1 


-Pi 



9 ( J x ) " 5 ( ) = ^j(yi) - hj(y ). 



Recall that hu = h j + hi by (jlip . Therefore, summing up term by term 
(|T5]l and (fTT|) . and considering (fTBT) . one obtains: 



Po / V 1 - Po 



PoJ Vi-po/ \PlJ V 1_ Pl 



(18) 



Recall that by hypothesis (fT5]) - p!7]) need to hold for every two func- 
tions hi and /ij arbitrarily chosen with the only requirement that p\ and P2 
uniquely exist. Hence, (fTHj) needs to hold for every (po,pi,P2) in (0, l) 3 . By 
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substituting t = po, x = pi/po and y = P2/P1, (fl8|) becomes 
1 — txy s 



a' (xy) - a' (- 



1 - 1 

, ,(l-tx\ , , (l-txy 



(19) 



which holds for every < t < 1, and every x,y > such that x < 1/t and 
y < l/(xt). Being g convex and differentiable, its derivative g' is continuous. 
Therefore, letting t go to zero, (|19p implies that 

g'(xy) = g'{x) + g' (y) - g'(l) (20) 

holds true for every x, y > 0. Define the function ip(-) = g'(-) — g'(l). This 
function is continuous, being g' such, and by (f20|) . (f(xy) = (f(x) + tp(y) 
holds for every x, y > 0. Hence, (p(-) is fcln(-) for some k, and therefore 

g'{x) = k ln(x) + g'(l), (21) 

where A: = (g'(2) — g'(l))/ ln(2). Being g convex, g' is not decreasing and 
therefore k > 0. If A; = 0, then is constant, which is impossible, otherwise, 
for any hi, p\ satisfying (I15p either would not exist or would not be unique. 
Therefore, k must be positive. Being g(l) = by assumption, f)21 1) implies 
that g(x) = kx\n{x) + (</(l) — k)(x — 1). Hence, 



D g (Q ll Q 2 ) = k J ln(^|) dQ 1 



holds true for some k > and for every couple of measures (Qi,Q2) such 
that Qi <€. Q 2 - 

□ 

In virtue of this theorem, the conditional distribution of Y given the 
information / is coherent only if it minimizes the loss 

L(A) := J hi(y) A(dy) + k j In f^j dA, (22) 

where k is some positive constant. To define the loss (|22p . one needs to 
assess hj and k. Notice that a probability distribution that minimizes L(\) 
, or equivalently L(\)/k, is uniquely identified by hj/k. In other words, 
assessing hj = h$ and k = ko is equivalent to assess hj = /io/fco and k = 1. 
For this reason, from now on, it will be convenient to fix k = 1. 
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In what follows, only coherent conditional distributions will be consid- 
ered. Therefore, D g will always be assessed to be the Kullback-Leibler di- 
vergence. Whenever a probability measure that minimizes (|22p (with k = 1) 
exists and is unique, it will be called the conditional probability distribution 
of Y given I and will be denoted by Pj. 

If 

-hi(y) P ( dy ) < QO) (23) 



then P/ is properly defined and is equal to 



Pi(A] _ f A e- h ^P(dy) 

for every measurable subset A of Y. In fact, 

1(A) = £>(A, Pj) - In (f^ e- hl ^ P(dy)\ 

holds true for every probability measure A on ^ such that A <C P. 

By (|24p , it is clear that the choice of the Kullback-Leibler divergence for 
D g and of a loss hi satisfying (|23p is sufficient for the coherence condition 
([H]) . Moreover, notice that Pj is defined to be a unique probability measure, 
not just essentially unique. 



3 Illustrations 

The loss function hj is chosen by the decision-maker on the basis of the 
available information. Such information sometimes happens to be stochastic, 
i.e to belong to a set of outcomes to which a probability is assigned. If this is 
the case, one should update the probability distribution of Y by means of the 
usual conditional distribution. Whenever there is a joint density / for X and 
Y, this is tantamount to use the self-information loss function hi(y,x) = 
— In fx\Y ( x \y)- If the- available information is not stochastic, then one can 
resort to the approach described in the present paper, properly assessing 
the loss function hj. To see a practical and simple example, consider the 
situation mentioned in the Introduction: 

Example 1. Y is a scalar quantity and the information / is that Y is close 
to zero. In this case, it is natural to assess: 

hi(y) = wy 2 , 
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where w is some positive constant, and the conditional distribution of Y 
given I is 

Pt(a) _ h^PiM 

Example 2. While for the second example everyone would know how to deal 
with, there is currently no formal mathematical mechanism for pursuing a 
conditional update. So suppose it becomes known that Y belongs to B, for 
some set B. Not because of some preliminary random experiment but rather 
due to it becoming aware to the decision maker that actually B is the set of 
possible values that Y can take. So the information is non-stochastic. The 
most natural choice is 



hi(y) = < 



' yeB 
k +oc y£B 



from which it is easy to deduce that the A minimising f hj(y) \(dy)+D(X, P) 
is given by 

Pj(A) = [ P(dy)/P(B). 
J AnB 

This example is relevant to a number of so-called paradoxes whereby it 
becomes apparent to the decision maker that the outcome space is smaller 
than the support of P (e.g. Freund's paradox of the two aces). How thi s 
is learnt is crucial. This has been pointed out by Hutchison (jl999l . 20081 ). 



If the information that Y belongs to B is based on some preliminary ran- 
dom experiment, for which a probability model is given, then obviously the 
unconditional distribution of Y can be updated resorting to the current 
definition of conditional probability. If not, there is not currently a rigor- 
ous justification for the usage of the conditional probability. The present 
paper provides a formal and broad enough framework to cover this case. 
Many philosophers of science, that are mentioned in the Introduction, have 
discovered paradoxes based on such scenarios. 

Example 3. To conclude, let us consider a simple and very concrete example. 
Consider a horse race, in which six horses participate. In order to decide 
how to bet, one assesses the probability for each horse to win. Denote by 
Pj the probability that the horse number j wins, for j S {1, . . . , 6}. In this 
example, Y is the number corresponding to the horse that will win. 

Before the race begins, it starts raining. Since conditions have changed, 
the probabilities need to be updated. It is problematic to pursue this aim 
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by resorting to the current definition of conditional probability. In fact, this 
requires to know the probability that it rains and that the horse number 
j wins. As an alternative, one could calculate the conditional probabilities 
of victory for each horse by applying Bayes' theorem, which requires the 
probability that it rains given the victory of horse j. But it is raining and 
the race is not yet run! 

It is therefore appropriate to resort to the definition of a conditional 
probability distribution given in this paper. To this aim, one can assess a 
score to evaluate the disadvantage due to the rain for each horse. Denote 
by hj the score referred to the horse number j. If the ability of the horse j 
is unaffected by the rain, then hj = 0. If not, hj is positive. A higher score 
will be given to those horses whose ability to run is more affected. In this 
way, one can set 

h i(y) = T, 6 j=ihji{j}(y), 

where / is the information that it's raining and Iq is the initial information 
about the horses and the weather. The updated probability that the j-th 
horse wins turns out to be 

for j = 1, . . . ,6. 

4 Discussion 

We have established a framework in which we can update probabilities in 
the light of general, i.e. non-stochastic, information. Given that we cannot 
connect the information and the outcome of interest via a probability model, 
we do so through a loss function. Minimizing a cumulative loss function 
involving the information on one side and the probability distribution on the 
other, yields the updated probability distribution. When the information is 
stochastic, we employ the self information loss function; the solution then 
reverts to the standard definition of conditional probability. 
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